Introduction

# This section is to install package that will be required to proceed with this section of the workshop.

# If you do not have "devtools" installed, uncomment the following line and run on your console
# install.packages("devtools")

# devtools::install_github("SydneyBioX/scdney")

You may visit our package website for the vignette of scdney package.

# We will subset Su et al. and Yang et al. datasets.
ids = which(colData(sce_scMerge)$batch %in% c("GSE87795", "GSE90047"))

lab = colData(sce_scMerge)$cellTypes[ids]

nCs = length(table(lab))

mat = SummarizedExperiment::assay(sce_scMerge, "scMerge")[,ids]

Cell type proportion

par(mar=c(10,7,2,2))
barplot(table(lab), las = 2, ylab = "No. of cells", main = "Cell type proportion")

We observe that hepatoblast/hepatocyte is the largest population.

Visualisation

# create tsne object
set.seed(123)
tsne_result = Rtsne(t(mat), check_duplicates = FALSE)

plot(tsne_result$Y, col = factor(lab) , main = "t-SNE plot (cells coloured by known cell types")

Cell Type identification

The data from single cell RNA-sequencing experiment does not provide cell type information for individual cells. We need to identify cell types of indivdual cells in our data with bioinformatics analysis.

One common method to explore single cell data is by clustering. Clustering is a type of machine learning technique to group similar samples (cell) to the same group and partition samples that are different by comparing their feature information (gene expression). To optimise clustering method, algorithm and similarity metric are two key components that affects clustering performance.

In our recent study, we found pearson correlation to be optimal similarity metric for comparing single cell RNA-seq data ( Kim et al., 2018 ). Therefore in this workshop, we will utilise scClust in our scdney package, which implemented 2017 Nature methods clustering algorithm SIMLR with pearson correlation.

We need to validate the number of clusters (k) in this dataset. there actually are in an unsupervised approach.

scClust (with SIMLR and pearson correlation)

# k = 6
simlr_result_k6 = scClust(mat, 6, similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration:  1 
## Iteration:  2 
## Iteration:  3 
## Iteration:  4 
## Iteration:  5 
## Iteration:  6 
## Iteration:  7 
## Iteration:  8 
## Iteration:  9 
## Iteration:  10 
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  0.1527458 
## Epoch: Iteration # 200  error is:  0.1218641 
## Epoch: Iteration # 300  error is:  0.1112158 
## Epoch: Iteration # 400  error is:  0.1059176 
## Epoch: Iteration # 500  error is:  0.1026344 
## Epoch: Iteration # 600  error is:  0.1003277 
## Epoch: Iteration # 700  error is:  0.0985936 
## Epoch: Iteration # 800  error is:  0.09722343 
## Epoch: Iteration # 900  error is:  0.09610318 
## Epoch: Iteration # 1000  error is:  0.09516611 
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  11.56649 
## Epoch: Iteration # 200  error is:  0.2845774 
## Epoch: Iteration # 300  error is:  0.2560054 
## Epoch: Iteration # 400  error is:  0.2410104 
## Epoch: Iteration # 500  error is:  0.2361371 
## Epoch: Iteration # 600  error is:  0.2331395 
## Epoch: Iteration # 700  error is:  0.231009 
## Epoch: Iteration # 800  error is:  0.229431 
## Epoch: Iteration # 900  error is:  0.2281798 
## Epoch: Iteration # 1000  error is:  0.2271669
# We will not run for various `k` to save time. Instead, we will load pre-computed result
# all_k = 3:8
# simlr.results = sapply(as.character(all_k), function(k) {
#   scClust(mat, as.numeric(k), similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
# }, USE.NAMES = TRUE, simplify = FALSE)


load(paste(getwd(), "/data/simlr.results.RData", sep = ""))

Find optimal k

all_wss = sapply(simlr_results, function(result) {
  sum(result$y$withinss)
}, USE.NAMES = TRUE, simplify = TRUE)

plot(x = names(all_wss), y = all_wss, main = "Total WSS for each k", xlab = "k", ylab = "total WSS")

As shown in this plot, the graph begin to plateau from k = 5. We can estimate that k is around 5 or 6. We can further investigate using silhouette scores or other metrics but using our t-SNE plot (and with a little cheating) we will estimate k = 6.

plot(tsne_result$Y, col = simlr_result_k6$y$cluster, main = "t-SNE plot coloured by cluster output")

Validation

Comparison of our result with cell type information

mclust::adjustedRandIndex(lab, simlr_result_k6$y$cluster)
## [1] 0.5752176
igraph::compare(as.numeric(factor(simlr_result_k6$y$cluster)), as.numeric(factor(lab)), method = "nmi")
## [1] 0.6873722
par(mfrow=c(1,2))
plot(tsne_result$Y, col = factor(lab), main = "t-SNE coloured by cell type information")
plot(tsne_result$Y, col = simlr_result_k6$y$cluster, main = "t-SNE coloured by our result")

Distribution of marker genes

There are 6 cell types in the dataset. We have obtained some marker genes for all the 6 cell types through some literature search. In this section, we will explain how we can visualise the distribution of marker genes.

Endothelial

Firstly, we can colour the t-SNE plot using the provided cell labels. In the figure eblow, Endothelial cells is red and the rest of the cells is black.

We can then colour the t-SNE plot according to expression value of Endothelial cells marker genes.

For example, in the code provided below, we first used which() to find out which row contains Sox17 gene. We then used ifelse() to indicate that, if a cell has Sox17 expression value < 2, then this cell will be coloured red, otherwise it will be coloured green.

The figure below illustrates that most of the cells in the top right cluster is coloured green and almost all the other cells is coloured red. This indicates that cells in the top right cluster are indeed different from the remaining cells in their expression pattern of Sox17. This potentially suggests that these cells belong to endothelial cells.

plot(tsne_result$Y, col = factor(lab == "Endothelial Cell"), main = "Endothelial Cell" )

index <- which(rownames(sce_scMerge[,ids]) == "Sox17")
plot(tsne_result$Y, col = ifelse( mat[2,] < 2 ,'red','green'), main = "Sox17")

index <- which(rownames(sce_scMerge[,ids]) == "Sox18")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Sox18")

index <- which(rownames(sce_scMerge[,ids]) == "Sox7")
#hist(mat[index,])

plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Sox7")

cholangiocyte

We can repeat the procudure on other cell type of interest.

plot(tsne_result$Y, col = factor(lab == "cholangiocyte") , main  = "cholangiocyte")

index <- which(rownames(sce_scMerge[,ids]) == "Sox9")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 6 ,'red','green'), main = "Sox9")

index <- which(rownames(sce_scMerge[,ids]) == "Epcam")
hist(mat[index,])

plot(tsne_result$Y, col = ifelse( mat[index,] < 6 ,'red','green'), main = "Epcam")

index <- which(rownames(sce_scMerge[,ids]) == "Krt19")
#hist(mat[index,])

plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Krt19")

plot(tsne_result$Y, col = factor(lab == "Mesenchymal Cell"), main = "Mesenchymal Cell")

index <- which(rownames(sce_scMerge[,ids]) == "Map2")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] > -1.5 ,'red','green'), main = "Map2")

index <- which(rownames(sce_scMerge[,ids]) == "Gfap")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 0.6 ,'red','green'), main = "Gfap")

hepatoblast/hepatocyte

plot(tsne_result$Y, col = factor(lab == "hepatoblast/hepatocyte"), main = "hepatoblast/hepatocyte" )

index <- which(rownames(sce_scMerge[,ids]) == "Foxa2")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Foxa2")

index <- which(rownames(sce_scMerge[,ids]) == "Hnf4a")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Hnf4a")

Hematopoietic

plot(tsne_result$Y, col = factor(lab == "Hematopoietic"), main = "Hematopoietic" )

index <- which(rownames(sce_scMerge[,ids]) == "Mpl")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] <1.5 ,'red','green'), main = "Mpl")

Immune cell

plot(tsne_result$Y, col = factor(lab == "Immune cell"), main = "Immune cell")

index <- which(rownames(sce_scMerge[,ids]) == "S100a9")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "S100a9")

index <- which(rownames(sce_scMerge[,ids]) == "Fcer1g")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 5 ,'red','green'), main = "Fcer1g")

index <- which(rownames(sce_scMerge[,ids]) == "Cd69")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 5 ,'red','green'), main = "Cd69")

SessionInfo

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cluster_2.0.9               Rtsne_0.15                 
##  [3] mclust_5.4.3                scdney_0.1.2               
##  [5] edgeR_3.26.4                limma_3.40.2               
##  [7] dplyr_0.8.1                 SingleCellExperiment_1.6.0 
##  [9] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
## [11] BiocParallel_1.18.0         matrixStats_0.54.0         
## [13] Biobase_2.44.0              GenomicRanges_1.36.0       
## [15] GenomeInfoDb_1.20.0         IRanges_2.18.0             
## [17] S4Vectors_0.22.0            BiocGenerics_0.30.0        
## 
## loaded via a namespace (and not attached):
##   [1] amap_0.8-17            minqa_1.2.4            colorspace_1.4-1      
##   [4] class_7.3-15           ggridges_0.5.1         htmlTable_1.13.1      
##   [7] XVector_0.24.0         base64enc_0.1-3        rstudioapi_0.10       
##  [10] mice_3.5.0             prodlim_2018.04.18     manipulate_1.0.1      
##  [13] mvtnorm_1.0-10         lubridate_1.7.4        codetools_0.2-16      
##  [16] splines_3.6.0          doParallel_1.0.14      knitr_1.23            
##  [19] Formula_1.2-3          nloptr_1.2.1           caret_6.0-84          
##  [22] clusteval_0.1          broom_0.5.2            compiler_3.6.0        
##  [25] backports_1.1.4        assertthat_0.2.1       Matrix_1.2-17         
##  [28] lazyeval_0.2.2         acepack_1.4.1          htmltools_0.3.6       
##  [31] tools_3.6.0            igraph_1.2.4.1         gtable_0.3.0          
##  [34] glue_1.3.1             GenomeInfoDbData_1.2.1 reshape2_1.4.3        
##  [37] Rcpp_1.0.1             nlme_3.1-140           iterators_1.0.10      
##  [40] timeDate_3043.102      xfun_0.7               gower_0.2.1           
##  [43] stringr_1.4.0          lme4_1.1-21            dendextend_1.12.0     
##  [46] pan_1.6                zlibbioc_1.30.0        MASS_7.3-51.4         
##  [49] scales_1.0.0           ipred_0.9-9            doSNOW_1.0.16         
##  [52] expm_0.999-4           RColorBrewer_1.1-2     yaml_2.2.0            
##  [55] gridExtra_2.3          ggplot2_3.1.1          rpart_4.1-15          
##  [58] latticeExtra_0.6-28    stringi_1.4.3          randomForest_4.6-14   
##  [61] foreach_1.4.4          e1071_1.7-1            checkmate_1.9.3       
##  [64] boot_1.3-22            lava_1.6.5             rlang_0.3.4           
##  [67] pkgconfig_2.0.2        bitops_1.0-6           evaluate_0.14         
##  [70] lattice_0.20-38        purrr_0.3.2            recipes_0.1.5         
##  [73] htmlwidgets_1.3        tidyselect_0.2.5       plyr_1.8.4            
##  [76] magrittr_1.5           R6_2.4.0               snow_0.4-3            
##  [79] DescTools_0.99.28      generics_0.0.2         Hmisc_4.2-0           
##  [82] mitml_0.3-7            pillar_1.4.1           foreign_0.8-71        
##  [85] withr_2.1.2            survival_2.44-1.1      RCurl_1.95-4.12       
##  [88] nnet_7.3-12            tibble_2.1.2           crayon_1.3.4          
##  [91] jomo_2.6-8             rmarkdown_1.13         viridis_0.5.1         
##  [94] locfit_1.5-9.1         grid_3.6.0             data.table_1.12.2     
##  [97] ModelMetrics_1.2.2     digest_0.6.19          tidyr_0.8.3           
## [100] munsell_0.5.0          viridisLite_0.3.0